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Abstract. We present a connectionist architecture that can learn a model of the relations 
between perceptions and actions and use this model for behavior planning. State representa- 
tions are learned with a growing self-organizing layer which is directly coupled to a perception 
and a motor layer. Knowledge about possible state transitions is encoded in the lateral con- 
nectivity. Motor signals modulate this lateral connectivity and a dynamic field on the layer 
organizes a planning process. All mechanisms are local and adaptation is based on Hebbian 
ideas. The model is continuous in the action, perception, and time domain. 



1 Introduction 

Planning of behavior requires some knowledge about the consequences of actions in a given 
environment. A world model captures such knowledge. It seems that the brain is capable of 
planning in a way that involves a simulation of actions and their perceptual consequences 
(see, e.g., Hesslow's [1] arguments for a simulation theory of cognitive brain function). How- 
ever, the level of abstraction, the representation, on which such simulation occurs is hardly 
the level of physical coordinates. A tempting hypothesis is that the representations the 
brain uses for reasoning and planning are particularly designed (by adaptation or evolu- 
tion) for just this purpose. To address such ideas we first need a basic model for how a 
connectionist architecture can encoded a world model and how self-organization of inherent 
representations is possible. 

In the field of machine learning, world models are a standard approach to handle behavior 
organization problems (for a comparison of model-based approaches to the classical, model- 
free Reinforcement Learning see [2]). Our approach for a connectionist world model (CWM) 
builds on the classical notions of a world model and is functionally similar to existing Machine 
Learning approaches with self-organizing state space models [3, 4]. It is able to grow neural 
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2 THE MODEL 



motor layer a 




Figure 1: Schema of the CWM architecture. 

representations for different world states and to learn the consequences of actions in terms 
of state transitions. It differs though from classical approaches in some crucial points: 

• The model is continuous in the action, the perception, as well as the time domain. 

• All mechanisms are based on local interactions. The adaptation mechanisms are largely 
derived from the idea of Hebbian plasticity. E.g., the lateral connectivity, which encodes 
knowledge about possible state transition, is adapted by a variant of the temporal Hcbb 
rule and allows local adaptation of the world model to local world changes. 

• The coupling to the motor system is fully integrated in the architecture via a mechanism 
incorporating modulating synapses (comparable to shunting mechanisms). 

• The two dynamic processes on the CWM, the "tracking" process estimating the current 
state and the planning process (similar to Dynamic Programming), will be realized by 
activation dynamics on the architecture, incorporating in particular lateral interactions, 
inspired by neural fields [5] . 

The outline of the paper is as follows: In the next section we describe our architecture, the 
dynamics of activation and the couplings to perception and motor layers. In section 3 we 
introduce a dynamic process that generates, as an attractor, a value field over the layer 
which is comparable to a state value function estimating the expected future return and 
allows for goal-oriented behavior organization. The self-organization process and adaptation 
mechanisms are described in section 4. We demonstrate the features of the model on a maze 
problem in section 5 and finally discuss the results and the model in general terms. 



2 The model 

The core of the connectionist world model (CWM) is a neural layer which is coupled to 
a perceptual layer and a motor layer, see figure 1. Let us enumerate the units of the 
central layer by i = 1, ,.,N. Lateral connections within the layer may exist and we denote a 
connection from the i-th to j-th unit by (ji). E.g., "X^yi)" means "summing over all existing 
connections . To every unit we associate an activation Xj € K which is governed by the 
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dynamics 



k s {sj,s) +i] ^k a (dji,d) Wji Xi , (1) 

(pi) 



which wc will explain in detail in the following. First of all, Xi are the time-dependent acti- 
vations and the dot-notation t x x = F(x) means a time derivative which we algorithmically 
implemented by a Euler integration step x{t) = x(t — 1) + — F(x(t — 1)). 

The first term in (1) induces an exponential relaxation while the second and third terms are 
the inputs. k s (sj, s) is the forward excitation that unit j receives from the perceptive layer. 
Here, Sj is the codebook vector (receptive field) of unit j onto the perception layer which 
is compared to the current stimulus s via the kernel function k s . We will choose Gaussian 
kernels as it is the case, e.g., for typical Radial Basis function networks. 

The third term, z^2(ji) k a (3ji, a) Wji Xi, describes the lateral interaction on the central layer. 
Namely, unit j receives lateral input from unit i iff there exists a connection (ji) from i to 
j. This lateral input is weighted by the connection's synaptic strength Wji. Additionally 
there is another term entering multiplicativcly into this lateral interaction: Lateral inputs 
are modulated depending on the current motor activation. We chose a modulation of the 
following kind: To every existing connection (ji) we associate a codebook vector dji onto 
the motor layer which is compared to the current motor activation a via a Gaussian kernel 
function k a . Due to the multiplicative coupling, a connection contributes to lateral inputs 
only when the current motor activation "matches" the codebook vector of this connection. 
The modulation of information transmission by multiplicative or divisive interactions is a 
fundamental principle in biological neural systems [6]. One example is shunting inhibition 
where inhibitory synapses attach to regions of the dentritic tree near to the soma and thereby 
modulate the transmission of the dentritic input [7]. In our architecture, a shunting synapse, 
receiving input from the motor layer, might attach to only one branch of a (lateral) dentritic 
tree and thereby multiplicativcly modulate the lateral inputs summed up at this subtree. 

For the following it is helpful if we briefly discuss a certain relation between equation (1) 
and a classical probabilistic approach. Let us assume normalized kernel functions 

k s (sj,s) = —7= — cxp 2 ^ , k a (dji,a) = —7= — exp 



z a s V 2 7r a a £ a a 

These kernel functions can directly be interpreted as probabilities: k s (sj, s) represents the 
probability P(s\j) that the stimulus is s if j is active, and k a {dji, a) the probability P(a\j, i) 
that the action is a if a transition i — > j occurred. As for typical hidden Markov models we 
may derive the prior probability distribution P(j\a), given the action: 

P(a\j,i) P(j\i) , (?} P(j\i) 

P{]M = — pW) — = ai ji, ) PW) ' 

p (j\a) = ^k a (aji,a) P< S) ■ 

P(a\i) can be computed by normalizing P(a\j,i) P(j\i) over j such that P(j\d,i) = 1. 
What we would like to mention here is that in equation (1), the lateral input 
2^2(ji) k a (dji, a) Wji Xi can be compared to the prior P(j\a) under the assumption that Xi is 
proportional to P(i) and if we have an adaptation mechanism for Wji which converges to a 
value proportional to P(j\i) and which also ensures normalization, i.e., ^ • k a {dji, a) Wji = 1 
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3 THE DYNAMICS OF PLANNING 



for all i and a. This insight will help to judge some details of the next two section. The 
probabilistic interpretation can be further exploited, e.g., comparing the input of a unit j (or, 
in the quasi-stationary case, Xj itself) to the posterior and deriving theoretically grounded 
adaptation mechanisms. But this is not within the scope of this paper. 



3 The dynamics of planning 

To organize goal-oriented behavior we assume that, in parallel to the activation dynamics 
(1), there exists a second dynamic process which can be motivated from classical approaches 
to Reinforcement Learning [8, 9]. Recall the Bellman equation 

V;(i) = £ 7r(a\i) £ P(j\i, a) [r(j) + 7 V;(j)] , (2) 
a j 

yielded by the expectation V*(i) of the discounted future return R(t) = ^ T=1 7 r ~ 1 g(t+r), 
which yields R(t) = g(t+l) + 7 R(t+i), when situated in state i. Here, 7 is the discount 
factor and we presumed that the received rewards g(t) actually depend only on the state 
and thus enter equation (2) only in terms of the reward function r(i) (we neglect here that 
rewards may directly depend on the action). Behavior is described by a stochastic policy 
ir(a\i), the probability of executing action a in state i. Knowing the property (2) of V* it is 
straight-forward to define a recursion algorithm for an approximation V of V* such that V 
converges to V*. This recursion algorithm is called Value Iteration and reads 

t v AV n (i) = + ]T ir(a\i) £ P(j\i, a) [r(j) + 7 V„(j)] , (3) 

a j 

with a "reciprocal learning rate" or time constant t v . Note that (2) is the fixed point 
equation of (3). 

The practical meaning of the state-value function V is that it quantifies how desirable and 
promising it is to reach a state i, also accounting for future rewards to be expected. In 
particular, if one knows the current state i it is a simple and efficient rule of behavior to 
choose that action a that will lead to the neighbor state j with maximal V(j) (the greedy 
policy). In that sense, V(i) provides a smooth gradient towards desirable goals. Note though 
that direct Value Iteration presumes that the state and action spaces are known and finite, 
and that the current state and the world model P(j\i, a) is known. 

How can we transfer these classical ideas to our model? We suppose that the CWM is given 
a goal stimulus g from outside, i.e., it is given the command to reach a world state that 
corresponds to the stimulus g. This stimulus induces a reward excitation ?*j = k s (s*j , g) for 
each unit i. Now, besides the activations Xi, we introduce another field over the CWM, the 
value field Vi, which is in analogy to the state- value function V(i). The dynamics is 

t v Vi — - Vi + n + 7 max(Wji Vj) , (4) 

O'i) 

and well comparable to (3): A slight difference is that Vi estimates the "current-plus- 
future" reward g(t) + 7 i?(t) rather than the future reward only — in the upper notation 
this corresponds to the slightly modified value iteration t v AV^li) = — V n (i) + r(i) + 
J2 a 7r(a|i) Ylj P(j\h a ) [7 Kr(i)] ■ As it is commonly done for Value Iteration, we assumed n 
to be the greedy policy. More precisely, we considered only that action (i.e., that connection 
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{ji)) that leads to the neighbor state j with maximal value wji vj. In effect, the summations 
over a as well as over j can be replaced by a maximization over {ji). Finally we replaced 
the probability factor P{j\i,a) by Wji — we will see in the next section how Wji is learned 
and what it will converge to. 

In practice, the value field will relax quickly to its fixed point v* = n + 7 maxy,) {wji v*) 
and stay there if the goal does not change and if the world model is not re-adapted (see the 
experiments). The quasi-stationary value field Vi together with the current (typically non- 
stationary) activations Xi allow the system to generate a motor signal that guides towards 
the goal. More precisely, the value field vi determines for every unit i the "best" neighbor 
unit fej = argmaxj Wji Vj . The output motor signal is then the activation average 

a = ^ x i ®kii (5) 

i 

of the motor codebook vectors a^i that have been learned for the corresponding connections. 
Hence, the information flow between the central layer and the motor system is in both ways: 
In the "tracking" process as given by equation (1) the information flows from the motor 
layer to the central layer: Motor signals activate the corresponding connections and cause 
lateral, predictive excitations. In the action selection process as given by equation (5) the 
signals flow from the central layer back to the motor layer to induce the motor activations 
that should turn predictions into reality. 

Depending on the specific problem and the motor system, a post-processing of the motor 
signal a, e.g. a competition between contradictory motor units, might be necessary. In our 
experiments we will have two motor units and will always normalize the 2D vector a to unit 
length. 



4 Self-organization and adaptation 

The self-organization process of the central layer combines techniques from standard self- 
organizing maps [10, 11] and their extensions w.r.t. growing representations [12, 13] and the 
learning of temporal dependencies in lateral connections [14, 15, 16]. The free variables of a 
CWM subject to adaptation are (1) the number of neurons and the lateral connectivity itself, 

(2) the codebook vectors Sj and dji to the perceptive and motor layers, respectively, and 

(3) the weights Wji of the lateral connections. The adaptation mechanisms we propose are 
based on three general principles: (1) the addition of units for representation of novel states 
{novelty), (2) the fine tuning of the codebook vectors of units and connections {plasticity) , 
and (3) the adaptation of lateral connections in favor of better prediction performance 
{prediction). 

Novelty. Mechanisms similar to those of FuzzyARTMAPs [12] or Growing Neural Gas [13] 
account for the insertion of new units when novelty is detected. We detect novelty in a 
straight-forward manner, namely when the difference between the actual perception and the 
best matching unit becomes too large. To make this detection more robust, we use a low-pass 
filter (leaky integrator). At a given time, let z be the best matching unit, z = argmaXjOJj. 
For this unit we integrate the error measure e z 

r e e z = -e x + (1 - k s {s z ,s)) . 
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We normalize k s (s z ,s) such that it equals 1 in the perfect matching case when s z = s. 
Whenever this error measure exceeds a threshold called vigilance, e z > v, v £ [0,1], we 
generate a new unit j with the codebook vector equal to the current perception, Sj = s, 
and a connection from the last best matching unit z' with the codebook vector equal to the 
current motor signal, a jz t = a,- The errors of both, the new and the old unit, are reset to 
zero, e z <— 0, ej = 0. 

Plasticity. We use simple Hebbian plasticity to fine tune the representations of existing 
units and connections. Over time, the receptive fields of units and connections become more 
and more similar to the average stimuli that activated them. We use the update rules 

t s s z = -s z +s, r a a zz -t = -a Z2 t + a , 

with learning time constants r s and r a . 

Prediction and a temporal Hebb rule. Although perfect prediction is not the actual 
objective of the CWM, the predictive power is a measure of the correctness of the learned 
world model and good predictive power is one-to-one with good behavior planning. The first 
and simple mechanism to adapt the predictive power is to grow a new lateral connection 
between two successive best matching units z' and z if it does not yet exist. The new 
connection is initialized with w zz t = 1 and a zz \ = a. The second, more interesting mechanism 
addresses the adaptation of Wji based on new experiences and can be motivated as follows: 
The temporal Hebb rule strengthens a synapse if the pre- and post-synaptic neurons spike 
in sequence, depending on the inter-spike-interval, and is supposed to roughly describe LTP 
and LTD (see, e.g. ,[17]). In a population code model, this corresponds to a measure of 
correlation between the pre-synaptic and the delayed post-synaptic activity. In our case we 
additionally have to account for the action-dependence of a lateral connection. We do so by 
considering the term k a (Sji , a) Xi instead of only the pre-synaptic activity. As a measure of 
temporal correlation we choose to relate this term to the derivative Xj of the post-synaptic 
unit instead of its delayed activation — this saves us from specifying an ad-hoc "typical" 
delay and directly reflects that, in equation (1), lateral inputs relate to the derivative of Xj. 
Hence, we consider the product ±j k a (a*ji , a) Xi as the measure of correlation. Our concrete 
implementation is a robust version of this idea: 

t w Wji = Kji [cji - Wji Kji\ , where 

7~K Cji Cji ~\~ Xj k a (aji, a^ Xi , T K Kji ^ji ~l~ kai^Qjii ^) Xi . 

Here, Cji and Kji are simply low-pass filters of ij k a {Sji, a) Xi and of k a (3ji, a) Xi. The term 
Wji Kji ensures convergence (assuming quasi static Cji and Kji) of wji towards Cjij Kji. The 
time scale of adaptation is modulated by the recent activity Kji of the connection. 

5 Experiments 

To demonstrate the functionality of the CWM we consider a simple maze problem. The 
parameters we used are 

t x rj 2 al 2 a\ t v 7 r e t s r a t w t k 



2 0.1 0.01 0.5 2 0.8 10 20 5 10 100 
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Figure 2a displays the geometry of the maze. The "agent" is allowed to move continuously in 
this maze. The motor signal is 2-dimensional and encodes the forces / in x- and y-directions; 
the agent has a momentum and friction according to x = 0.2 (/ — x). As a stimulus, the 
CWM is given the 2D position x. 

Figure 2a also displays the (lateral) topology of the central layer after 30 000 time steps of 
self-organization, after which the system becomes quasi-stationary. The model is learned 
from scratch, initialized with one random unit. During this first phase, behavior planning 
is switched off and the maze is explored with a random walk that changes its direction only 
with probability 0.1 at a time. In the illustration, the positions of the units correspond to 
the codebook vectors that have been learned. The dircctedness and the codebook vectors of 
the connections can not displayed. 

After the self-organization phase we switched on behavior planning. A goal stimulus cor- 
responding to a random position in the maze is given and changed every time the agent 
reaches the goal. Generally, the agent has no problem finding a path to the goal. Figure 2b 
already displays a more interesting example. The agent has reached goal A and now seeks 
for goal B. However, we blocked the trespass 1. Starting at A the agent moves normally until 
it reaches the blockade. It stays there and moves slowly up an down in front of the blockade 
for a while — this while is of the order of the low-pass filter time scale t k . During this time, 
the lateral weights of the connections pointing to the left are depressed and after about 150 
time steps, this change of weights has enough influence on the value field dynamics (4) to 
let the agent chose the way around the bottom to goal B. Figure 2c displays the next scene: 
Starting at B, the agent tries to reach goal C again via the blockade 1 (the previous adapta- 
tion depressed only the connections from right to left). Again, it reaches the blockade, stays 
there for a while, and then takes the way around to goal C. Figures 2d and 2e repeat this 
experiment with blockade 2. Starting at D, the agent reaches the blockade 2 and eventually 
chooses the way around to goal E. Then, seeking for goal F, the agent reaches the blockade 
first from the left, thereafter from the bottom, then from the right, then it tries from the 
bottom again, and finally learned that none of these paths are valid anymore and chooses 
the way all around to goal F. Figures 2f shows that, once the world model has re-adapted 
to account for these blockades, the agent will not forget about them: Here, moving from G 
to H, it does not try to trespass block 2. 

The reader is encouraged to also refer to the movies of the experiments, deposited at 
www.marc-toussaint.net/03-cwm/, which visualize much better the dynamics of self-organi- 
zation, the planning behavior, the dynamics of the value field, and the world model readap- 
t at ion. 



6 Discussion 

The model we proposed is a connectionist architecture that can represent and learn the rela- 
tion between motor signals and perception. The model is continuous in action, perception, 
and time domain, does not presume any a priori knowledge about the motor system, and a 
dynamical value field on the learned world model organizes behavior planning — a method in 
principle borrowed from classical Value Iteration. A major feature of our model is its adapt- 
ability. The state space model is developed in a self-organizing way and small world changes 
require only little re-adaptation of the CWM. Generally speaking, the model is a highly 
functional system based on only local mechanisms. It demonstrates what these mechanisms 
can accomplish when embedded in a suitable structure, e.g., the concrete functionality of a 
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Figure 2: The CWM on a maze problem: (a) the outcome of self-organization; (b-c) agent 
movements from goal A to B to C, here, the trespass 1 was blocked and requires readapta- 
tion of the world model; (d-f) agent movements that demonstrate adaptation to a second 
blockade. Please see the text for more explanations. 

temporal Hebb rule w.r.t. behavior planning in our model, or the functionality of synapse 
modulation as we employed it. 

Future work will include the more rigorous probabilistic interpretations of CWMs which we 
already indicated in section 2. Another, rather straight-forward extension will be to replace 
random- walk exploration by more directed, information seeking exploration methods as they 
have already been developed for classical world models [18, 19, 20]. A deeper and still open 
question is the matter of distributed representations: Can a procedure analogous to Value 
Iteration be generalized to multi-modal representations or representations that are composed 
of two layers, each corresponding to another "perceptual dimension"? The binding problem 
is touched here and it seems that the value dynamics will have to be sequential rather than 
parallel. 
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